Arrests %>%
count(year) %>%
ggplot(aes(x = year, y = n)) +
geom_line() +
geom_point() +
labs(
title = "Yearly Arrests",
x = "Year", y = "Number of Arrests"
) +
theme_minimal()
observations:
- Arrests start high in ~2010, with counts over 3,500 per year.
- Arrest numbers remain relatively stable between ~2010 and 2016, fluctuating slightly between ~3,200–3,600.
- Beginning around 2017–2018, there is a noticeable decline in arrests.
- The drop becomes much sharper around 2019–2020, falling to nearly half of earlier levels — this likely corresponds with the COVID-19 pandemic and lockdowns, when many cities saw decreased police activity and fewer arrests.
- After 2020, there’s a small rebound, but the counts remain significantly lower than pre-2017 levels.
Arrests %>%
count(year, month_num) %>%
mutate(YearMonth = as.Date(paste(year, month_num, "01", sep = "-"))) %>%
ggplot(aes(x = YearMonth, y = n)) +
geom_line() +
labs(title = "Monthly Arrests Over Time", x = "Year-Month", y = "Number of Arrests") +
scale_x_date(date_breaks = "6 months", date_labels = "%Y-%m") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))+
theme_minimal()
Observations:
2009–2016: Arrests fluctuate between ~250–400 per month, with periodic peaks (approaching ~500) and valleys. There appears to be some seasonality: higher counts in some mid-year (likely summer) months, lower in others (likely winter).
2017–2018: Monthly counts start to decline gradually compared to earlier years. The peaks are less pronounced, staying mostly below 350.
2019–2020: Sharp decline begins late 2019 and early 2020 — arrests drop rapidly to below 200 per month. This drop aligns with the COVID-19 pandemic and lockdowns, when many jurisdictions reduced arrests.
2021–2025: Counts remain low (~100–150 per month) compared to pre-2019 levels. There are occasional small spikes, but they stay much lower than earlier years, suggesting a lasting reduction.
HEAT MAP
Arrests %>%
count(year, month) %>%
ggplot(aes(x = month, y = factor(year), fill = n)) +
geom_tile() +
scale_fill_viridis_c() +
labs(
title = "Heatmap of Arrests by Month and Year",
x = "Month", y = "Year", fill = "Arrests"
)+
theme_minimal()
Observations:
2010–2016:
Arrest levels are high overall (green to yellow hues) with a clear concentration in summer months (June–August) — particularly July & August, which show the most intense (yellow) spots.
Winter months (January–February, December) consistently show lower levels (blueish shades), which is expected due to seasonality.
2017–2018:
Arrests begin to decline slightly — the heatmap colors shift towards cooler greens and blues even in summer months, suggesting fewer arrests compared to earlier years.
2019–2020:
A dramatic drop appears starting in 2020 — most months, including summer, turn dark blue indicating very low arrest counts.
The drop is most pronounced in spring and summer 2020 — coinciding with COVID-19 restrictions.
2021–2024:
Arrest levels remain low (blue to dark blue), and the distinct summer peaks almost disappear.
The seasonality seems flattened — suggesting the usual summer increase has weakened post-pandemic.
Seasonal pattern:
Consistent before 2017:
- Peaks: June–August
- Lows: January–February & December
Post-2020, this pattern weakens.
Boxplot: Distribution of arrests by month
# Prepare data
monthly_counts <- Arrests %>%
count(year, month) %>%
mutate(
month_num = as.numeric(month),
# assign season
season = case_when(
month_num %in% c(12, 1, 2) ~ "Winter",
month_num %in% c(3, 4, 5) ~ "Spring",
month_num %in% c(6, 7, 8) ~ "Summer",
month_num %in% c(9, 10, 11) ~ "Fall"
)
) %>%
filter(!is.na(season)) %>%
mutate(season = droplevels(factor(season)))
# Define seasonal colors
season_colors <- c(
"Winter" = "skyblue3",
"Spring" = "springgreen3",
"Summer" = "indianred",
"Fall" = "gold"
)
# Plot
p <- ggplot(monthly_counts, aes(x = month, y = n, fill = season, color = season, group = month)) +
geom_boxplot() +
scale_y_continuous(breaks = seq(0, 500, 50)) +
scale_fill_manual(values = season_colors) +
scale_color_manual(values = season_colors) +
labs(
title = 'Distribution of Monthly Arrests by Month',
subtitle = 'Year: {current_frame}',
x = 'Month', y = 'Number of Arrests', fill = "Season", color = "Season"
) +
theme_minimal(base_size = 14) +
theme(
plot.subtitle = element_text(size = 18, face = "bold", hjust = 0.5)
)
# Animate with transition_manual
anim <- p +
transition_manual(year)
# save animation:
# animate(anim, width = 900, height = 600, fps = 1, duration = length(unique(monthly_counts$year)) * 4, renderer = gifski_renderer())
# anim_save("animated_boxplot_monthly_arrests_synced.gif")
General Trends:
The boxplots show clear seasonality:
Winter (Jan–Feb, Dec) — lowest arrests, consistently around ~200–250.
Spring (Mar–May) — arrests start increasing, reaching ~300–350.
Summer (Jun–Aug) — peaks, especially July & August, with arrests reaching ~400–500.
Fall (Sep–Nov) — begins to decline from summer peaks, stabilizing around ~300.
The seasonal pattern is most visible and consistent in the earlier years (2010–2016).
#checking why no spring and summer in 2020
# List of all years & months in your data
all_months <- month.abb
# All year-month combinations
all_combinations <- expand_grid(
year = unique(Arrests$year),
month = all_months
)
# Actual counts
actual_counts <- Arrests %>%
count(year, month)
# Convert both month columns to character explicitly
all_combinations <- all_combinations %>%
mutate(month = as.character(month))
actual_counts <- actual_counts %>%
mutate(month = as.character(month))
# Left join + fill missing with 0
complete_counts <- all_combinations %>%
left_join(actual_counts, by = c("year", "month")) %>%
mutate(n = replace_na(n, 0)) %>%
arrange(year, match(month, all_months))
# Show missing months
missing_months <- complete_counts %>% filter(n == 0)
print(missing_months)
## # A tibble: 20 × 3
## year month n
## <dbl> <chr> <int>
## 1 2021 Apr 0
## 2 2021 May 0
## 3 2021 Jun 0
## 4 2021 Jul 0
## 5 2021 Aug 0
## 6 2021 Sep 0
## 7 2024 May 0
## 8 2024 Jun 0
## 9 NA Jan 0
## 10 NA Feb 0
## 11 NA Mar 0
## 12 NA Apr 0
## 13 NA May 0
## 14 NA Jun 0
## 15 NA Jul 0
## 16 NA Aug 0
## 17 NA Sep 0
## 18 NA Oct 0
## 19 NA Nov 0
## 20 NA Dec 0
Seasonal barplot
Arrests %>%
count(year, season) %>%
filter(!is.na(season)) %>%
ggplot(aes(x = factor(year), y = n, fill = season)) +
geom_bar(stat = "identity", position = "dodge") +
scale_fill_manual(values = season_colors) +
theme_minimal()+
labs(
title = "Arrests by Season and Year",
x = "Year", y = "Number of Arrests", fill = "Season"
)
General Observations:
1. In all years, Summer shows the highest number of arrests, peaking around ~1,200 in 2010–2016.
2. Fall and Spring are similar to each other, generally slightly below Summer but still high.
3. Winter consistently has the lowest number of arrests each year — roughly ~50–70% of Summer.
Trend Over Time:
Arrest numbers declined steadily from ~2015 to 2019 across all seasons.
- Summer arrests dropped from ~1,100 in 2014 to ~600–700 by 2019.
- Fall and Spring show a similar downward trend.
- Winter remained the lowest but also declined.
Impact of COVID-19:
In 2020–2021, a dramatic drop is visible in all seasons:
Winter remains low.
Summer and Fall drop sharply, to about ~300 or lower — nearly ¼ of the pre-pandemic peak.
Spring also falls dramatically.
This aligns with what we observed in earlier plots: the pandemic significantly reduced arrests.
Post-2021 (pandemic ends in May 2023):
In 2022–2024, some recovery is observed:
- Slight uptick in Spring and Fall arrests.
- Summer remains much lower than its pre-2020 highs.
- Overall, the seasonal differences are less pronounced in recent years than before 2020.
Seasonal Patterns throughout the years:
Summer > Fall ≈ Spring > Winter
Calendar Heatmap
# Prepare data
daily_arrests <- Arrests %>%
filter(!is.na(ArrestDate)) %>%
count(ArrestDate) %>%
mutate(year = lubridate::year(ArrestDate))
# Get years
years <- sort(unique(daily_arrests$year))
# Loop: save PNG for each year
for (yr in years) {
cat("Rendering year:", yr, "\n")
p <- ggplot_calendar_heatmap(
daily_arrests %>% filter(year == yr),
'ArrestDate',
'n'
) +
scale_fill_gradient(low = "white", high = "red", name = "Arrests", na.value = "grey") +
labs(
title = sprintf("Calendar Heatmap of Daily Arrests — Year: %s", yr),
subtitle = "Darker colors indicate more arrests, grey = missing data"
) +
theme_minimal(base_size = 14)
# ggsave(sprintf("calendar_%s.png", yr), plot = p, width = 12, height = 8)
}
## Rendering year: 2010
## Rendering year: 2011
## Rendering year: 2012
## Rendering year: 2013
## Rendering year: 2014
## Rendering year: 2015
## Rendering year: 2016
## Rendering year: 2017
## Rendering year: 2018
## Rendering year: 2019
## Rendering year: 2020
## Rendering year: 2021
## Rendering year: 2022
## Rendering year: 2023
## Rendering year: 2024
imgs <- list.files(
path = "heat map calendar for each year",
pattern = "\\.png$",
full.names = TRUE
) |> sort()
length(imgs)
## [1] 0
output_gif <- "calendar_heatmap_animation.gif"
# gifski(
# png_files = imgs,
# gif_file = output_gif,
# delay = 2, # adjust speed (seconds per frame)
# width = 1200,
# height = 800,
# loop = TRUE
#)
Incidents <- incidents %>%
mutate(
Date = as.Date(Occur_Date), # adjust as needed
Month = month(Date, label = TRUE),
Year = year(Date)
) %>%
filter(!is.na(Latitude) & !is.na(Longitude)) # keep valid rows
# create output folder if it doesn't exist
if (!dir.exists("maps_2010")) dir.create("maps_2010")
# loop through months 1 to 12
for (m in 1:12) {
month_name <- month.abb[m] # "Jan", "Feb", etc.
freq_grid <- Arrests %>%
filter(year == 2010, month == month_name) %>%
mutate(
rlat = round(latitude * 200) / 200,
rlon = round(longitude * 200) / 200
) %>%
group_by(rlat, rlon) %>%
summarize(n = n(), .groups = "drop")
if (nrow(freq_grid) == 0) next
p <- ggplot() +
geom_tile(data = freq_grid, aes(x = rlon, y = rlat, fill = n)) +
geom_path(data = ch, aes(x = V1, y = V2), color = "black") +
coord_quickmap(xlim = c(-79.09, -78.99), ylim = c(35.86, 35.99)) +
labs(
title = "Arrest Frequency in Chapel Hill",
subtitle = sprintf("2010 - Month: %s", month_name),
fill = "Arrest Count",
x = "Longitude",
y = "Latitude"
) +
scale_fill_viridis_c() +
theme_minimal()
# ggsave(filename = sprintf("maps_2010/arrests_2010_%s.png", month_name),plot = p, width = 8, height = 6)
}
# get all PNGs recursively
pngs <- list.files(
path = "yearly_maps",
pattern = "\\.png$",
recursive = TRUE,
full.names = TRUE
)
# sort files by year and month
pngs_sorted <- pngs[order(pngs)] # assumes your files are named arrests_YYYY_MMM.png
# order chronologically
month_order <- setNames(1:12, month.abb)
pngs_sorted <- pngs %>%
tibble(path = .) %>%
mutate(
fname = basename(path),
year = as.numeric(stringr::str_extract(fname, "\\d{4}")),
month_str = stringr::str_extract(fname, paste(month.abb, collapse = "|")),
month_num = month_order[month_str]
) %>%
arrange(year, month_num) %>%
pull(path)
#gifski(
# png_files = pngs_sorted,
# gif_file = "all_years_arrests.gif",
# width = 1200,
# height = 800,
# delay = 2, # seconds per frame
# loop = TRUE
#)
# Interpreter: Sarah
Bazari
# filters just for chapel hill
chapel_hill_arrests = Arrests %>%
filter(City == "CHAPEL HILL")
# count arrests per zipcode
zip_counts = chapel_hill_arrests %>%
count(Zip, sort = TRUE)
zip_counts
## # A tibble: 5 × 2
## Zip n
## <dbl> <int>
## 1 27514 22774
## 2 27516 11808
## 3 27517 2705
## 4 27515 18
## 5 27599 4
# count most common charges per zipcode
offense_by_zip = chapel_hill_arrests %>%
count(Zip, Charge, sort = TRUE)
# see the top offenses per zip
offense_by_zip %>%
group_by(Zip) %>%
slice_max(n, n = 3) %>% # top 3 per zip
arrange(Zip, desc(n))
## # A tibble: 19 × 3
## # Groups: Zip [5]
## Zip Charge n
## <dbl> <chr> <int>
## 1 27514 IMPAIRED DRIVING DWI 2458
## 2 27514 FAIL TO APPEAR/COMPL 2263
## 3 27514 OPEN CONTAINER 1565
## 4 27515 INJURY TO TREES/LANDCAPE M 4
## 5 27515 ASSAULT ON FEMALE 2
## 6 27515 ASSAULT ON GOVERMENT OFFICIAL 2
## 7 27515 DRIVING WHILE IMPAIRED 2
## 8 27515 FAIL TO APPEAR/COMPL 2
## 9 27515 IMPAIRED DRIVING DWI 2
## 10 27515 LARCENY - ALL OTHER 2
## 11 27515 OPEN CONTAINER 2
## 12 27516 FAIL TO APPEAR/COMPL 1108
## 13 27516 OPEN CONTAINER 1001
## 14 27516 IMPAIRED DRIVING DWI 867
## 15 27517 IMPAIRED DRIVING DWI 369
## 16 27517 FAIL TO APPEAR/COMPL 258
## 17 27517 ASSAULT ON FEMALE 128
## 18 27599 DRUG PARAPHERNALIA 2
## 19 27599 POSS OR MANUFACTURE FRAUDULENT FORMS OF ID 2
# visualiztion bar plot
ggplot(zip_counts, aes(reorder(as.character(Zip), -n), y = n)) +
geom_col(fill = "#E63946", width = 0.7) + # Nice deep red tone
labs(
title = "Number of Arrests by Zip Code in Chapel Hill",
subtitle = "Zip codes with the highest arrest counts shown in descending order",
x = "Zip Code",
y = "Number of Arrests"
) +
theme_minimal(base_size = 14) +
theme(
plot.title = element_text(face = "bold", size = 16),
plot.subtitle = element_text(size = 12, color = "gray40"),
axis.text.x = element_text(angle = 45, hjust = 1, size = 12),
axis.text.y = element_text(size = 12),
axis.title = element_text(face = "bold"),
plot.margin = margin(10, 20, 10, 10)
)
# most common charges for each weapon type
charges_by_weapon = Arrests %>%
filter(!is.na(Weapon)) %>%
group_by(Weapon, Charge) %>%
summarize(Count = n(), .groups = "drop") %>%
arrange(Weapon, desc(Count))
head(charges_by_weapon, 10)
## # A tibble: 10 × 3
## Weapon Charge Count
## <chr> <chr> <int>
## 1 CLUB/BLACKJACK/BRASS KNUCKLES ASSAULT W DEADLY WEAPON 4
## 2 CLUB/BLACKJACK/BRASS KNUCKLES PHYSICAL SIMPLE ASSAULT-NON AGGRAVATED 4
## 3 CLUB/BLACKJACK/BRASS KNUCKLES ADW-OTHER WEAPON 3
## 4 CLUB/BLACKJACK/BRASS KNUCKLES CCW 3
## 5 CLUB/BLACKJACK/BRASS KNUCKLES ASLT INFL BODY INJURY F 2
## 6 CLUB/BLACKJACK/BRASS KNUCKLES ASSAULT & BATTERY 2
## 7 CLUB/BLACKJACK/BRASS KNUCKLES ASSAULT ON FEMALE 2
## 8 CLUB/BLACKJACK/BRASS KNUCKLES ASSAULT ON GOVERNMENT OFFICIAL 2
## 9 CLUB/BLACKJACK/BRASS KNUCKLES ASSAULT-SIMPLE 2
## 10 CLUB/BLACKJACK/BRASS KNUCKLES AWIK/SERIOUS INJURY 2
# weapon presence by race and gender
weapon_by_race_gender = Arrests %>%
filter(!is.na(Weapon), Gender %in% c("M", "F"), Race !="U") %>%
group_by(Weapon, Race, Gender) %>%
summarise(Count = n(), .groups = "drop") %>%
arrange(Weapon, desc(Count))
head(weapon_by_race_gender, 10)
## # A tibble: 10 × 4
## Weapon Race Gender Count
## <chr> <chr> <chr> <int>
## 1 CLUB/BLACKJACK/BRASS KNUCKLES B M 12
## 2 CLUB/BLACKJACK/BRASS KNUCKLES W M 7
## 3 CLUB/BLACKJACK/BRASS KNUCKLES B F 4
## 4 CLUB/BLACKJACK/BRASS KNUCKLES W F 4
## 5 CLUB/BLACKJACK/BRASS KNUCKLES H M 2
## 6 FIREARM (TYP NOT STATED) B M 23
## 7 FIREARM (TYP NOT STATED) B F 5
## 8 FIREARM (TYP NOT STATED) W M 2
## 9 HANDGUN B M 224
## 10 HANDGUN W M 70
# visual 'weapon_by_race_gender'
library(stringr)
weapon_by_race_gender$Weapon <- str_wrap(weapon_by_race_gender$Weapon, width = 15)
ggplot(weapon_by_race_gender, aes(x = Race, y = Count, fill = Gender)) +
geom_col(position = "dodge", width = 0.7) +
facet_wrap(~ Weapon, scales = "free_y", ncol = 3) +
scale_fill_manual(values = c("F" = "#f8766d", "M" = "#00bfc4")) +
labs(
title = "Weapons Involved in Arrests by Race and Gender",
subtitle = "Grouped by weapon type and separated by gender",
x = "Race",
y = "Number of Arrests",
fill = "Gender"
) +
theme_minimal(base_size = 12) +
theme(
strip.text = element_text(face = "bold", size = 9, angle = 0),
strip.text.x = element_text(margin = margin(b = 5)),
axis.text.x = element_text(angle = 45, hjust = 1),
legend.position = "top"
)
# visual for most common charges for the top 3 weapon types
library(ggplot2)
top_charges = Arrests %>%
filter(!is.na(Charge)) %>%
count(Charge, sort = TRUE) %>%
slice_max(n, n = 15)
ggplot(top_charges, aes(x = reorder(Charge, n), y = n)) +
geom_col(fill = "steelblue") +
coord_flip() +
labs(title = "Top 15 Charges in Chapel Hill", x = "Charge", y = "Number of Arrests") +
theme_minimal()
# fix name of one of the weapons
data = Arrests
data <- data %>%
mutate(Weapon = str_to_upper(Weapon)) %>%
mutate(Weapon = ifelse(Weapon == "LETHAL CUTTING INSTRUMENT", "CUTTING INSTRUMENT", Weapon))
# top 3 weapons
top_weapons <- data %>%
filter(!is.na(Weapon)) %>%
count(Weapon, sort = TRUE) %>%
slice_max(n, n = 3) %>%
pull(Weapon)
# top 5 charges each weapon
top_charges <- data %>%
filter(Weapon %in% top_weapons, !is.na(Charge)) %>%
group_by(Weapon, Charge) %>%
summarise(Count = n(), .groups = "drop") %>%
group_by(Weapon) %>%
slice_max(Count, n = 5) %>%
ungroup()
top_charges$Charge <- str_wrap(top_charges$Charge, width = 20)
ggplot(top_charges, aes(x = Charge, y = Count)) +
geom_col(fill = "steelblue") +
facet_wrap(~ Weapon, scales = "free") +
labs(
title = "Top 5 Charges for the 3 Most Common Weapon Types",
x = "Charge", y = "Number of Arrests"
) +
coord_cartesian(clip = "off") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 55, hjust = 1, size = 5),
strip.text = element_text(size = 10)
)
age = Arrests$Age
age_group = c()
for (i in 1:length(age)) {
if (is.na(age[i])) {
age_group = c(age_group, "Unknown")
} else if (age[i] <= 25) {
age_group = c(age_group, "Young")
} else if (age[i] <= 45) {
age_group = c(age_group, "Middle-aged")
} else {
age_group = c(age_group, "Older")
}
}
age_group = factor(age_group, levels = c("Young", "Middle-aged", "Older", "Unknown"))
police_arrests = mutate(Arrests, Age_Group = age_group)
top_10_data = police_arrests %>%
group_by(Age_Group, Charge) %>%
summarise(Count = n()) %>%
mutate(Proportion = Count / sum(Count)) %>%
arrange(Age_Group, desc(Proportion)) %>%
mutate(Rank = min_rank(desc(Proportion))) %>%
filter(Rank <= 10)
## `summarise()` has grouped output by 'Age_Group'. You can override using the
## `.groups` argument.
ggplot(top_10_data) +
geom_bar(aes(x = reorder(Charge, -Proportion), y = Proportion, fill = Age_Group), stat = "identity") +
facet_wrap(~Age_Group, nrow = 1) +
theme(axis.text.x = element_text(angle = 90, hjust = 1, size = 6)) +
labs(title = "Top 10 Charges by Age Group", y = "Proportion", x = "Charge")
incidents = incidents %>%
mutate(
Year = year(Report_Date)
)
arrests = Arrests %>%
mutate(
Year = year(Arrest_Date)
)
in1 = incidents %>%
filter(Year %in% 2016:2020, Offense == "SHOTS FIRED" | Offense == "GUNSHOTS")
ar1 = arrests %>%
mutate(Year = year(Arrest_Date)) %>%
filter(Year %in% 2016:2020,
Weapon %in% c("HANDGUN", "RIFLE", "SHOTGUN", "OTHER FIREARM", "FIREARM (TYP NOT STATED)"))
in1 <- in1 %>% mutate(Incident_Day = as.Date(Report_Date))
ar1 <- ar1 %>% mutate(Arrest_Day = as.Date(Arrest_Date))
ar1_same_day <- ar1 %>%
filter(Arrest_Day %in% in1$Incident_Day)
incidents_plot <- in1 %>%
transmute(
Lat = Latitude,
Long = Longitude,
Type = "Incident",
Day = Incident_Day
)
arrests_plot <- ar1_same_day %>%
transmute(
Lat = latitude,
Long = longitude,
Type = "Arrest",
Day = Arrest_Day
)
plot_data <- bind_rows(incidents_plot, arrests_plot)
shape_values <- c(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24)
ggplot(plot_data, aes(x = Long, y = Lat, color = Type, shape = as.factor(Day))) +
geom_point(size = 3, alpha = 1) +
labs(
title = "Gun Incidents & Same-Day Gun Arrests",
x = "Longitude",
y = "Latitude",
color = "Type",
shape = "Date (Same-Day Pairs)",
caption = "Red = Incident, Blue = Arrest\nDifferent shapes = different days"
) +
theme_minimal() +
scale_shape_manual(values = shape_values) +
guides(
shape = guide_legend(override.aes = list(size = 4)),
color = guide_legend(override.aes = list(size = 4))
)
Update: We are no longer using the incidents data set partly because of this EDA question that revealed a few things about the nature of that data. 1. There are no incident reports for the date of the UNC shooting at the correct time 2. There is no arrest report for the date of the UNC shooting These two factors helped us to learn the incident data is a lot less comprehensive than we thought. Our theory on the shootings absence from the arrest data is that because other departments were involved CHPD didn’t report it as they didn’t make the arrest. It likely shows up in the Orange County system or maybe the State Trooper data.
#Creating Two Categories: Fall Semester and Spring Semester (when most students are on campus) and Breaks (When most students are off campus)
arrests = Arrests %>%
mutate(
Year = year(Arrest_Date),
Month = month(Arrest_Date),
Day = day(Arrest_Date),
Hour = hour(Arrest_Date)
)
Semesters = arrests %>%
filter(
(Month == 1 & Day >= 7) |
(Month %in% 2:4) |
(Month == 5 & Day <= 10) |
(Month == 8 & Day >= 15) |
(Month %in% 9:11) |
(Month == 12 & Day <= 14)
)
Breaks = arrests %>%
filter(
!((Month == 1 & Day >= 7) |
(Month %in% 2:4) |
(Month == 5 & Day <= 10) |
(Month == 8 & Day >= 15) |
(Month %in% 9:11) |
(Month == 12 & Day <= 14)))
OnBreak = arrests %>%
mutate(OnBreak = (ifelse(Object_Id %in% Breaks$Object_Id, "On Break", "Semester"))) %>%
mutate(YearMonth = as.Date(paste(Year, Month, "01", sep = "-"))) %>%
group_by(YearMonth, OnBreak) %>%
summarise(Count = n(), .groups = "drop")
ggplot(OnBreak, aes(x = YearMonth, y = Count, fill = OnBreak)) +
geom_col(position = "stack") +
labs(
title = "Monthly Arrest Counts: Semester vs. Break",
x = "Month",
y = "Number of Arrests",
fill = "Status"
) +
theme_minimal()
# Load dataset
#print(colnames(arrests))
arrests <- Arrests %>%
filter(Drugs_Alcohol %in% c("Y", "N")) %>%
mutate(
Substance = if_else(Drugs_Alcohol == "Y", "Substance Present", "No Substance"),
arrest_datetime = ymd_hms(Arrest_Date),
hour = hour(arrest_datetime),
day = day(arrest_datetime),
month = month(arrest_datetime, label = TRUE, abbr = FALSE),
weekday = wday(arrest_datetime, label = TRUE, abbr = FALSE)
)
# Clean and transform data
arrests <- arrests %>%
filter(Drugs_Alcohol %in% c("Y", "N")) %>%
mutate(
Substance = if_else(Drugs_Alcohol == "Y", "Substance Present", "No Substance"),
Arrest_Date = ymd_hms(Arrest_Date),
hour = hour(Arrest_Date)
)
# Summarize by hour
hourly_counts <- arrests %>%
count(hour, Substance) %>%
group_by(Substance) %>%
mutate(Proportion = n / sum(n))
# Plot
ggplot(hourly_counts, aes(x = hour, y = Proportion, fill = Substance)) +
geom_col(position = "dodge") +
labs(
title = "Proportion of Arrests by Hour of Day",
x = "Hour of Day (0 = Midnight, 23 = 11PM)",
y = "Proportion of Arrests",
fill = "Substance Involvement"
) +
scale_y_continuous(labels = scales::percent_format()) +
scale_x_continuous(breaks = 0:23) +
theme_minimal()
# Summarize by day of week
weekday_counts <- arrests %>%
mutate(day_of_week = wday(Arrest_Date, label = TRUE, abbr = FALSE)) %>%
count(day_of_week, Substance) %>%
group_by(Substance) %>%
mutate(Proportion = n / sum(n))
ggplot(weekday_counts, aes(x = day_of_week, y = Proportion, fill = Substance)) +
geom_col(position = "dodge") +
geom_text(
aes(label = scales::percent(Proportion, accuracy = 0.1)),
position = position_dodge(width = 0.9),
vjust = -0.5,
size = 3.5
) +
labs(
title = "Arrest Proportions by Day of Week and Substance Presence",
x = "Day of Week",
y = "Proportion of Arrests",
fill = "Substance"
) +
scale_y_continuous(labels = percent_format())
The analysis reveals a clear temporal pattern in substance-involved arrests within Chapel Hill. In particular there is a significant increase in the proportion of arrests involving substances on the days of Friday, Saturday, Sunday. Even more specific there is a substantial concentration of arrests between the times 7pm - 3am. This strongly correlates with the typical nightlife activity in the area. This indicates that law enforcement should strategize to be more proactive and aware of these elevated occurence substance related incidents during this time and provide sufficient prevention.
# Filter Drugs_Alcohol and remove unknowns
arrests <- Arrests %>%
filter(
!is.na(Gender), Gender %in% c("M", "F"), # keep only Male and Female
!is.na(Drugs_Alcohol), Drugs_Alcohol %in% c("Y", "N") # keep only valid substance values
) %>%
mutate(
Substance = ifelse(Drugs_Alcohol == "Y", "Substance Present", "No Substance")
)
### Summarize Proportions by Gender
# Filter out unknowns if needed, then summarize
q8_summary <- arrests %>%
group_by(Gender, Substance) %>%
summarise(Count = n(), .groups = "drop") %>%
group_by(Gender) %>%
mutate(Proportion = Count / sum(Count))
ggplot(q8_summary, aes(x = Gender, y = Proportion, fill = Substance)) +
geom_col(position = "dodge") +
geom_text(aes(label = scales::percent(Proportion, accuracy = 0.1)),
position = position_dodge(width = 0.9),
vjust = -0.5, size = 3.5) +
scale_fill_manual(
values = c("Substance Present" = "skyblue", "No Substance" = "red")
) +
labs(
title = "Proportion of Substance-Involved Arrests by Gender",
x = "Gender",
y = "Proportion of Arrests",
fill = "Substance Involvement"
) +
scale_y_continuous(labels = scales::percent_format()) +
theme_minimal()
This plot shows the proportion of arrests involving drugs or alcohol for each gender. The data helps us identify whether substance-involved arrests are more common among one gender relative to their total arrests in Chapel Hill. Among all arrests 66.6% of male arrests involved drugs or alcohol, compared to 54.8% of Female arrests having substance involvement. This proportion reveals that substance involvement was more prevalent among males. This difference could reflect variation the types of offenses committed or a difference in how police handle males and females that are under the influence of substances.
Substance involved arrests account for more than half of Female arrests and two thirds of Male arrests indicating that drugs or alcohol play a significant role in the majority of arrest events in Chapel Hill. This implies a strong correlation between substance presence and events leading to an arrest. These findings have implications for substance abuse prevention programs and substance abuse based police training.
Potential further questions based on this investigation. 1. Can we predict the likelihood of a substance involved arrest based on gender, time of day, age, and use this to deploy substance abuse prevention resources.
Arrests %>%
group_by(Arrest_Type, Race) %>%
summarize(n=n()) %>%
inner_join(summarize(group_by(Arrests,Race),nr=n())) %>%
mutate(prop=n/nr) %>%
ggplot() + geom_tile(aes(x=Race,y=Arrest_Type,fill=prop))
## `summarise()` has grouped output by 'Arrest_Type'. You can override using the
## `.groups` argument.
## Joining with `by = join_by(Race)`
Arrests %>%
group_by(Arrest_Type) %>%
summarize(av_lon=mean(longitude), av_lat=mean(latitude),av_dev=mean(sqrt((longitude-av_lon)^2+(latitude-av_lat)^2)))
## # A tibble: 3 × 4
## Arrest_Type av_lon av_lat av_dev
## <chr> <dbl> <dbl> <dbl>
## 1 ON VIEW -79.0 35.9 0.0225
## 2 SUMMONED/CITED -79.1 35.9 0.0176
## 3 TAKEN INTO CUSTODY (WARRANT/LP) -79.0 35.9 0.0200
type_grid = Arrests %>%
mutate(rlat=round(latitude*200)/200, rlon=round(longitude*200)/200) %>%
group_by(rlat,rlon) %>%
summarize(on_view_prop=sum(Arrest_Type=='ON VIEW')/n(), n=n()) %>%
filter(1.96*sqrt(on_view_prop*(1-on_view_prop)/n)<.15&on_view_prop>0&on_view_prop<1) # Margin in error is less than .15
## `summarise()` has grouped output by 'rlat'. You can override using the
## `.groups` argument.
ggplot() +
geom_tile(data=type_grid, aes(x=rlon, y=rlat,fill=on_view_prop)) +
#geom_point(data= arrests, aes(x=longitude, y=latitude, color=Arrest_Type), alpha = .1) +
geom_path(data = ch, aes(x=`V1`, y=`V2`)) +
coord_quickmap(xlim=c(-79.09,-78.99), ylim=c(35.86,35.99))
We decided to investigate Q3 and Q4 in further detail.
uknown_age_data = police_arrests %>%
filter(is.na(Age)) %>%
group_by(Street, latitude, longitude) %>%
summarise(Count = n()) %>%
arrange(desc(Count)) %>%
head(20)
## `summarise()` has grouped output by 'Street', 'latitude'. You can override
## using the `.groups` argument.
leaflet() %>%
addTiles() %>%
setView(lng = -79.0558, lat = 35.9132, zoom = 13) %>%
addMarkers(lng = -79.0558, lat = 35.9132, popup = "Chapel Hill, NC") %>%
addCircleMarkers(lng = uknown_age_data$longitude, lat = uknown_age_data$latitude, radius = uknown_age_data$Count / 2)
police_department_data = filter(police_arrests, Street == "828 MARTIN LUTHER KING JR BLVD", Charge == "CONSUME ALCOHOLIC BEVERAGE LESS THAN 21" | str_detect(Charge, "DRUG"))
ggplot(police_department_data)+
geom_bar(aes(x = Arrest_Type))
#checking why no spring and summer in 2020
# List of all years & months in your data
all_months <- month.abb
# All year-month combinations
all_combinations <- expand_grid(
year = unique(Arrests$year),
month = all_months
)
# Actual counts
actual_counts <- Arrests %>%
count(year, month)
# Convert both month columns to character explicitly
all_combinations <- all_combinations %>%
mutate(month = as.character(month))
actual_counts <- actual_counts %>%
mutate(month = as.character(month))
# Left join + fill missing with 0
complete_counts <- all_combinations %>%
left_join(actual_counts, by = c("year", "month")) %>%
mutate(n = replace_na(n, 0)) %>%
arrange(year, match(month, all_months))
# Show missing months
missing_months <- complete_counts %>% filter(n == 0)
print(missing_months)
## # A tibble: 20 × 3
## year month n
## <dbl> <chr> <int>
## 1 2021 Apr 0
## 2 2021 May 0
## 3 2021 Jun 0
## 4 2021 Jul 0
## 5 2021 Aug 0
## 6 2021 Sep 0
## 7 2024 May 0
## 8 2024 Jun 0
## 9 NA Jan 0
## 10 NA Feb 0
## 11 NA Mar 0
## 12 NA Apr 0
## 13 NA May 0
## 14 NA Jun 0
## 15 NA Jul 0
## 16 NA Aug 0
## 17 NA Sep 0
## 18 NA Oct 0
## 19 NA Nov 0
## 20 NA Dec 0
missing_months %>%
kable("html", align = "lccrr") %>%
kable_styling(full_width = TRUE)
| year | month | n |
|---|---|---|
| 2021 | Apr | 0 |
| 2021 | May | 0 |
| 2021 | Jun | 0 |
| 2021 | Jul | 0 |
| 2021 | Aug | 0 |
| 2021 | Sep | 0 |
| 2024 | May | 0 |
| 2024 | Jun | 0 |
| NA | Jan | 0 |
| NA | Feb | 0 |
| NA | Mar | 0 |
| NA | Apr | 0 |
| NA | May | 0 |
| NA | Jun | 0 |
| NA | Jul | 0 |
| NA | Aug | 0 |
| NA | Sep | 0 |
| NA | Oct | 0 |
| NA | Nov | 0 |
| NA | Dec | 0 |
# Prepare data
daily_arrests <- Arrests %>%
filter(!is.na(ArrestDate)) %>%
count(ArrestDate) %>%
mutate(year = lubridate::year(ArrestDate))
# Get years
years <- sort(unique(daily_arrests$year))
# Loop: save PNG for each year
for (yr in years) {
cat("Rendering year:", yr, "\n")
p <- ggplot_calendar_heatmap(
daily_arrests %>% filter(year == yr),
'ArrestDate',
'n'
) +
scale_fill_gradient(low = "white", high = "red", name = "Arrests", na.value = "grey") +
labs(
title = sprintf("Calendar Heatmap of Daily Arrests — Year: %s", yr),
subtitle = "Darker colors indicate more arrests, grey = missing data"
) +
theme_minimal(base_size = 14)
# ggsave(sprintf("calendar_%s.png", yr), plot = p, width = 12, height = 8)
}
## Rendering year: 2010
## Rendering year: 2011
## Rendering year: 2012
## Rendering year: 2013
## Rendering year: 2014
## Rendering year: 2015
## Rendering year: 2016
## Rendering year: 2017
## Rendering year: 2018
## Rendering year: 2019
## Rendering year: 2020
## Rendering year: 2021
## Rendering year: 2022
## Rendering year: 2023
## Rendering year: 2024
imgs <- list.files(
path = "heat map calendar for each year",
pattern = "\\.png$",
full.names = TRUE
) |> sort()
length(imgs)
## [1] 0
output_gif <- "calendar_heatmap_animation.gif"
#gifski(
# png_files = imgs,
# gif_file = output_gif,
# delay = 2, # adjust speed (seconds per frame)
# width = 1200,
# height = 800,
# loop = TRUE
#)
Arrests numbers have generally fallen year-over-year since 2009, most drastically during the pandemic. Somewhat counterintuitively, arrest numbers are generally lower in the summer months and school breaks, despite the presence of fewer students to arrest. Arrests are concentrated in population centers, along 86 and Franklin Street and in downtown Chapel Hill, and the two relevant zip codes. Underage drinking and drug related arrests are most concentrated in downtown Chapel Hill and Franklin Street, as well as at the police department. DWI is a common charge, especially for the young, while open container and failure to appear charges surprisingly inscrease in frequency with age group. Police data on reported incidents turned out to have very little relationship with the arrest data, with only 24 same-day pairs of gun-related incident reports and gun-related arrests in the entire dataset. Substance involved arrests are, predictably, more likely late at night and on weekends, and offenders are more likely to be men than in other arrests. Arrests of racially Hispanic and Asian offenders are more likely to be on-site arrests, while black offenders were likely to be taken into custody. On-view arrests were geographically well-distributed but much less common on the UNC campus.
To explore these patterns further, we investigated whether the “Unknown” age group may include minors and why there was a high concentration of drug and alcohol charges at the Chapel Hill Police Department’s address. We found that the “Unknown” group had a high number of underage drinking and marijuana-related charges, and these records often lacked details like age, race, and gender. Many of these arrests also occurred near schools or the police department, suggesting that they may involve minors whose information is legally protected. Additionally, examining arrests at the police department revealed that most were labeled as “Summoned/Cited” rather than physical detainments. This implies that the address may reflect the location where citations were processed, not necessarily where offenses occurred. Together, these findings suggest that a combination of data protection practices and recording conventions could explain both the anonymity in age and the arrest clustering at the police station. From the first plot investigating question 4, the sharp decline in arrests in 2021 coincides with missing data for several months of that year. According to the missing-months table, arrests were not recorded (or records are missing) for months such as April–September 2021. Therefore, the drop is at least partially attributable to incomplete data rather than a real drop in arrest activity. And from the animated heatmap, We can see the trends in arrests over the previous 15 years by looking at the calendar heatmap GIF, which cycles through all years and months in a chronological manner. The heatmaps’ color saturation shows the number of arrests; more arrests are indicated by darker red. Most months have at least some activity over the years, but the red’s intensity varies, indicating variations in the number of arrests over time. Many months appear grey, indicating missing data for those periods, especially during COVID years. The dramatic decline in annual arrests during the trend years is explained by this visual evidence. The animation illustrates the annual and seasonal trends in arrests while also drawing attention to the data gaps